NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c-part1 / 6108 < prev next >

Wrap

Internet Message Format | 1996-08-05 | 3.0 KB

Path: po.CWRU.Edu!mab22 From: mab22@po.CWRU.Edu (Michael A. Balfour) Newsgroups: comp.lang.c Subject: Re: unique id for a string Date: 22 Feb 1996 19:37:19 GMT Organization: Case Western Reserve University, Cleveland, OH (USA) Message-ID: <4giglf$99h@madeline.INS.CWRU.Edu> References: <1996Feb16.175601.114182@kuhub.cc.ukans.edu> <harmon.824680556@pegasus.montclair.edu> Reply-To: mab22@po.CWRU.Edu (Michael A. Balfour) NNTP-Posting-Host: kanga.ins.cwru.edu In a previous article, harmon@pegasus.montclair.edu (Derek Harmon) says: >anh@kuhub.cc.ukans.edu writes: >>This is not a quite a C problem. But since the implementation will be in >>C anyway. > >>Another way is to simply assign a id to a word and store the word in a >>binary search tree, and therefore, it is relatively fast to make sure >>each word has an unique id. But I think it is still to slow. We are >>talking about a possible 200,000+ words. > > How balanced the tree would be depends on either the order of input, or >the tree's self-balancing of itself which makes input longer, but logarith- >mically lowers the bound on search time. But we are talking about more than >200,000 words here, and that is where the problem occurs. If the max length >of these strings is 32 characters, then we are talking about roughly 7.3 x >10^43 different strings. That will not fit into a 64-bit long. :) > Another possibility to approach is a hash table. A method I've been examining for doing searches on strings instantly without even storing the list of strings involves minimal perfect hash functions. The general idea is that every string you're using as input hashes to a specific unique value. Using this technique, you just hash the string to be searched on, and use the hash value to index into an array of return values. A good paper on the topic can be found at ftp.cs.uq.oz.au. The paper is "An optimal algorithm for generating minimal perfect hash functions" written by Zbigniew Czech, George Havas, and Bohdan Majewski. Mike Balfour > The formula you give above will give all those strings distinct id nums, >if you have knowledge of the probability of occurence of symbols from that >language, you can shift the probability of smaller id num size (there will >still be large id nums) for a random sampling of those strings. However, >if symbols have equal probabilities of selection, finnagling of the formula >won't add any advantage. If it were possible, the world would be a path to >your door for an infinite data compression method. > > -- Stone >-- ># Derek Harmon (aka Stonelight) harmon@pegasus.montclair.edu ># - Computer Science Undergrad, Montclair State University, NJ ># - My views are my own, nobody else is this creative. 3;)> >... Recursive. adj., see Recursive. > > -- ----------------------------------+-------------------------------- Mike Balfour, Partner | BS/MS Graduate - ECMP Overload Engineering | Case Western Reserve University "New Ideas for a Brighter Future" | Cleveland, OH